Skip to content

fix(grafanactl): reconcile stale URLs and delete orphaned Grafana datasources#258

Open
cssjr wants to merge 9 commits into
Azure:mainfrom
cssjr:fix/reconcile-stale-grafana-datasource-urls
Open

fix(grafanactl): reconcile stale URLs and delete orphaned Grafana datasources#258
cssjr wants to merge 9 commits into
Azure:mainfrom
cssjr:fix/reconcile-stale-grafana-datasource-urls

Conversation

@cssjr

@cssjr cssjr commented Jun 25, 2026

Copy link
Copy Markdown

Summary

  • The modify datasource reconcile command only managed Azure Monitor Workspace integrations (resource IDs on the Grafana ARM resource) but never checked the actual datasource URLs or removed orphaned datasources in Grafana.
  • After integration reconciliation, the command now:
    • Updates stale datasource URLs when the AMW PrometheusQueryEndpoint hostname has changed (fixes DNS resolution errors)
    • Deletes orphaned Managed_Prometheus_* datasources whose workspaces no longer exist (consolidates clean fixup-datasources logic into the pipeline step)
  • Orphan detection uses workspace existence, not endpoint presence, to avoid a delete-recreate loop with workspaces that exist but haven't provisioned a Prometheus endpoint yet.
  • Both operations respect --dry-run.

Fixes: ARO-27914, AROSLSRE-1347, AROSLSRE-585

Test plan

  • cd tools/grafanactl && go build ./... — compiles cleanly
  • cd tools/grafanactl && go vet ./... — no issues
  • Dry-run against dev Grafana — correctly identifies stale URLs and orphaned datasources
  • Live run against dev Grafana — 24 stale URLs updated, 10 orphaned datasources deleted
  • Verification re-run — 0 stale, 0 orphaned, 49 current (idempotent)
  • Verify dashboards still load after changes

🤖 Generated with Claude Code

@cssjr cssjr marked this pull request as ready for review June 25, 2026 21:54
Copilot AI review requested due to automatic review settings June 25, 2026 21:54

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances grafanactl modify datasource reconcile so it not only reconciles Azure Monitor Workspace (AMW) integrations on the Managed Grafana ARM resource, but also detects and fixes stale Managed_Prometheus_* Prometheus datasource URLs in Grafana by aligning them with each workspace’s current PrometheusQueryEndpoint.

Changes:

  • Added a Grafana client helper to update an existing datasource via the Grafana API.
  • Wired a Grafana API client into the modify datasource reconcile command execution path.
  • Implemented datasource URL reconciliation logic for Managed_Prometheus_* Prometheus datasources, honoring --dry-run.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
tools/grafanactl/internal/grafana/client.go Adds UpdateDataSource wrapper to support updating datasources via Grafana API.
tools/grafanactl/cmd/modify/options.go Instantiates and carries a Grafana API client alongside existing ARM clients.
tools/grafanactl/cmd/modify/cmd.go Collects AMW query endpoints and updates stale Managed_Prometheus_* datasource URLs (supports dry-run).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tools/grafanactl/cmd/modify/cmd.go Outdated
@cssjr cssjr force-pushed the fix/reconcile-stale-grafana-datasource-urls branch from fdc4ac3 to ba01314 Compare June 26, 2026 03:52
The modify datasource reconcile command only managed Azure Monitor
Workspace integrations (resource IDs) but never checked the actual
datasource URLs in Grafana. When an AMW Prometheus query endpoint
hostname changes, datasource URLs become stale and dashboards fail
with DNS resolution errors.

After integration reconciliation, the command now lists Grafana
datasources, compares each Managed_Prometheus_* URL against the
current AMW PrometheusQueryEndpoint, and updates any that differ.

Fixes: ARO-27914

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cssjr cssjr force-pushed the fix/reconcile-stale-grafana-datasource-urls branch from ba01314 to dc0e6a9 Compare June 26, 2026 03:59
…URL updates

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 26, 2026 04:02

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment thread tools/grafanactl/internal/grafana/client.go
Comment thread tools/grafanactl/cmd/modify/cmd.go Outdated
workspace.Properties.Metrics.PrometheusQueryEndpoint == nil {
continue
}
if *workspace.Properties.ProvisioningState == armmonitor.ProvisioningStateSucceeded {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

workspaces can be in an another provision state (something something updating) intermittently

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It now skips workspaces if they are Failed or Cancelled but includes workspaces in any other state (Succeeded, Updating, Creating). If a workspace is Creating but doesn't have an endpoint yet, it eventually gets skipped at the nil reference check.

cssjr and others added 3 commits June 26, 2026 09:25
…asources

The modify datasource reconcile command only managed Azure Monitor
Workspace integrations (resource IDs) but never checked the actual
datasource URLs or removed orphaned datasources in Grafana.

After integration reconciliation, the command now:
- Updates datasource URLs when the AMW PrometheusQueryEndpoint has
  changed (fixes DNS resolution errors from stale hostnames)
- Deletes orphaned Managed_Prometheus_* datasources whose workspaces
  no longer exist

Both operations respect --dry-run. This consolidates the cleanup
previously handled by the separate clean fixup-datasources command
into the pipeline-integrated reconcile step.

Fixes: ARO-27914

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…dpoint presence

A workspace that exists but has no PrometheusQueryEndpoint yet would
be kept as an integration (causing Grafana to maintain its datasource)
but treated as orphaned by datasource reconciliation (causing deletion).
This created a delete-recreate loop.

Now orphan detection checks whether the workspace exists at all, and
only skips URL comparison for workspaces without endpoints yet.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tale-grafana-datasource-urls

Adds orphaned datasource deletion and fixes workspace existence
check to avoid delete-recreate loop.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cssjr cssjr changed the title fix(grafanactl): reconcile stale Grafana datasource URLs during modify fix(grafanactl): reconcile stale URLs and delete orphaned Grafana datasources Jun 26, 2026
@cssjr cssjr requested a review from Copilot June 26, 2026 18:56

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread tools/grafanactl/cmd/modify/cmd.go
Comment thread tools/grafanactl/cmd/modify/cmd.go
Workspaces can be in transitional states like Creating or Updating
intermittently. Only exclude workspaces in terminal failure states
(Failed, Canceled) from integration and orphan detection, so that
workspaces being updated are not temporarily removed and recreated.

getWorkspaceEndpoints still requires Succeeded since transitional
workspaces may not have a valid PrometheusQueryEndpoint yet.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread tools/grafanactl/cmd/modify/cmd.go
Comment thread tools/grafanactl/internal/grafana/client.go
cssjr and others added 2 commits June 26, 2026 12:40
- Include datasource name and ID in UpdateDataSource error messages
  for easier troubleshooting
- Collect all reconciliation errors (both deletes and updates) using
  errors.Join instead of returning on the first update failure, which
  would silently drop accumulated delete errors

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
A workspace transitioning through Updating still has its
PrometheusQueryEndpoint from the previous Succeeded state. Use
isTerminalFailureState instead of requiring Succeeded, and rely on
the existing nil guard for workspaces that genuinely lack an endpoint.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 26, 2026 19:45

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment thread tools/grafanactl/cmd/modify/cmd.go
…r exist

Orphan detection now includes all workspaces regardless of
provisioning state. A workspace in Failed or Canceled state still
exists as an Azure resource, so its datasource should be preserved —
it may help identify broken clusters through the Grafana UI, and the
pipeline will restore everything when the workspace is fixed.

Datasources are only deleted when the workspace is truly gone (not
returned by the API at all). Provisioning state filtering remains
only for integration reconciliation and endpoint URL updates.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@raelga raelga left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@janboll

janboll commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

/lgtm
/approve

@janboll

janboll commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

/hold

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants